There are 2 packages you will need to install for today’s practical:
install.packages(c("h2o", "eegkit", "forecast", "tseries")
apart from that everything else should already be available on your
system. However, I will endeavour to use explicit imports to make it
clear where functions are coming from (functions without
library_name::
are part of base R or a function we’ve
defined in this notebook).
knitr::opts_chunk$set(echo = TRUE)
# experimenting with this ML library on my quest to find something pleasant to use in R
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o::h2o.init(nthreads = 1)
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 days 34 minutes
## H2O cluster timezone: America/Halifax
## H2O data parsing timezone: UTC
## H2O cluster version: 3.36.1.2
## H2O cluster version age: 17 days
## H2O cluster name: H2O_started_from_R_fin_smu226
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.38 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 1
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.1.3 (2022-03-10)
# EEG manipulation library in R (although very limited compared to signal processing libraries available in other languages, matlab might actually still be a leader in this specific area)
library(eegkit)
## Loading required package: eegkitdata
## Loading required package: bigsplines
## Loading required package: quadprog
## Loading required package: ica
## Loading required package: rgl
## Loading required package: signal
##
## Attaching package: 'signal'
## The following objects are masked from 'package:stats':
##
## filter, poly
# some time series functions (that we only skim the depths of)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
# just tidyverse libraries that should already be installed
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:signal':
##
## filter
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(purrr)
library(ggplot2)
One of the most common types of medical sensor data (and one that we
talked about during the lecture) are Electroencephalograms (EEGs).
These measure mesoscale electrical signals (measured in microvolts)
within the brain, which are indicative of a region of neuronal activity.
Typically, EEGs involve an array of sensors (aka channels) placed on the
scalp with a high degree of covariance between sensors.
As EEG data can be very large and unwieldy, we are going to use a relatively small/simple dataset today from this paper.
This dataset is a 117 second continuous EEG measurement collected
from a single person with a device called a “Emotiv EEG Neuroheadset”.
In combination with the EEG data collection, a camera was used to record
whether person being recorded had their eyes open or closed. This was
eye status was then manually annotated onto the EEG data with
1
indicated the eyes being closed and 0
the
eyes being open. Measures microvoltages are listed in chronological
order with the first measured value at the top of the dataframe.
Let’s parse the data directly from h2o
’s test data S3
bucket:
eeg_url <- "https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate_splits.csv"
eeg_data <- dplyr::as_tibble(h2o::h2o.importFile(eeg_url))
##
|
| | 0%
|
|============= | 19%
|
|================================================ | 69%
|
|======================================================================| 100%
# add timestamp
Fs <- 117 / dim(eeg_data)[1]
eeg_data <- eeg_data %>% dplyr::mutate(ds=seq(0, 116.99999, by=Fs), eyeDetection=as.factor(eyeDetection))
print(eeg_data %>% dplyr::group_by(eyeDetection) %>% dplyr::count())
## # A tibble: 2 × 2
## # Groups: eyeDetection [2]
## eyeDetection n
## <fct> <int>
## 1 0 8257
## 2 1 6723
# split dataset into train, validate, test
eeg_train <- eeg_data %>% dplyr::filter(split=='train') %>% dplyr::select(-split)
print(eeg_train %>% dplyr::group_by(eyeDetection) %>% dplyr::count())
## # A tibble: 2 × 2
## # Groups: eyeDetection [2]
## eyeDetection n
## <fct> <int>
## 1 0 4916
## 2 1 4072
eeg_validate <- h2o::as.h2o(eeg_data %>% dplyr::filter(split=='valid') %>% dplyr::select(-split))
##
|
| | 0%
|
|======================================================================| 100%
eeg_test <- h2o::as.h2o(eeg_data %>% dplyr::filter(split=='test') %>% dplyr::select(-split))
##
|
| | 0%
|
|======================================================================| 100%
0 Knowing the eeg_data
contains 117
seconds of data, inspect the eeg_data
dataframe and work
out approximately how many samples per second were taken?
1 How many EEG electrodes/sensors were used?
Now that we have the dataset and some basic parameters let’s begin with the ever important/relevant exploratory data analysis.
First we should check there is no missing data!
h2o::h2o.nacnt(h2o::as.h2o(eeg_data))
##
|
| | 0%
|
|======================================================================| 100%
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Great, now we can start generating some plots to look at this data within the time-domain.
melt <- reshape2::melt(eeg_data %>% dplyr::select(-split), id.vars=c("eyeDetection", "ds"), variable.name = "Electrode", value.name = "microvolts")
ggplot2::ggplot(melt, ggplot2::aes(x=ds, y=microvolts, color=Electrode)) +
ggplot2::geom_line() +
ggplot2::ylim(3500,5000) +
ggplot2::geom_vline(ggplot2::aes(xintercept=ds), data=dplyr::filter(melt, eyeDetection==1), alpha=0.005)
2 Do you see any obvious patterns between eyes being open (dark grey blocks in the plot) and the EEG intensities?
3 Similarly, based on the distribution of eye open/close state over time to anticipate any temporal correlation between these states?
Let’s see if we can directly look at the distribution of EEG intensities and see how they related to eye status.
melt_train <- reshape2::melt(eeg_train, id.vars=c("eyeDetection", "ds"), variable.name = "Electrode", value.name = "microvolts")
# filter huge outliers in voltage
filt_melt_train <- dplyr::filter(melt_train, microvolts %in% (3750:5000)) %>% dplyr::mutate(eyeDetection=as.factor(eyeDetection))
ggplot2::ggplot(filt_melt_train, ggplot2::aes(y=Electrode, x=microvolts, fill=eyeDetection)) + ggplot2::geom_boxplot()
Plots are great but sometimes so it is also useful to directly look at the summary statistics and how they related to eye status:
filt_melt_train %>% dplyr::group_by(eyeDetection, Electrode) %>%
dplyr::summarise(mean = mean(microvolts), median=median(microvolts), sd=sd(microvolts)) %>%
dplyr::arrange(Electrode)
## `summarise()` has grouped output by 'eyeDetection'. You can override using the
## `.groups` argument.
## # A tibble: 28 × 5
## # Groups: eyeDetection [2]
## eyeDetection Electrode mean median sd
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 0 AF3 4294. 4300 35.4
## 2 1 AF3 4305. 4300 34.4
## 3 0 F7 4015. 4020 28.4
## 4 1 F7 4007. 4000 24.9
## 5 0 F3 4268. 4260 20.9
## 6 1 F3 4269. 4260 17.4
## 7 0 FC5 4124. 4120 17.3
## 8 1 FC5 4124. 4120 19.2
## 9 0 T7 4341. 4340 13.9
## 10 1 T7 4342. 4340 15.5
## # … with 18 more rows
4 Based on these analyses are any electrodes consistently more intense or varied when eyes are open?
We can also explore the data in frequency space by using a Fast
Fourier Transform.
After the FFT we can summarise the distributions of frequencies by their
density across the power spectrum. This will let us see if there any
obvious patterns related to eye status in the overall frequency
distributions.
eegkit::eegpsd(eeg_train %>% dplyr::filter(eyeDetection == 0) %>% dplyr::select(-eyeDetection, -ds), Fs = Fs, xlab="Eye Open")
eegkit::eegpsd(eeg_train %>% dplyr::filter(eyeDetection == 1) %>% dplyr::select(-eyeDetection, -ds), Fs = Fs, xlab="Eye Closed")
8 Do you see any differences between the power spectral densities for the two eye states? If so, describe them.
We may also wish to explore whether there are multiple sources of
neuronal activity being picked up by the sensors.
This can be achieved using a process known as independent component
analysis (ICA) which decorrelates the channels and identifies the
primary sources of signal within the decorrelated matrix.
ica <- eegkit::eegica(eeg_train %>% dplyr::select(-eyeDetection, -ds), nc=3, method='fast', type='time')
mix <- dplyr::as_tibble(ica$M)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
mix$eyeDetection <- eeg_train$eyeDetection
mix$ds <- eeg_train$ds
mix_melt <- reshape2::melt(mix, id.vars=c("eyeDetection", "ds"), variable.name = "Independent Component", value.name = "M")
ggplot2::ggplot(mix_melt, ggplot2::aes(x=ds, y=M, color=`Independent Component`)) +
ggplot2::geom_line() +
ggplot2::geom_vline(ggplot2::aes(xintercept=ds), data=dplyr::filter(mix_melt, eyeDetection==1), alpha=0.005) +
ggplot2::scale_y_log10()
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 8987 row(s) containing missing values (geom_path).
9 Does this suggest eye activate forms an independent component of activity across the electrodes?
Now that we’ve explored the data let’s use a simple model to see how well we can predict eye status from the EEGs:
model <- h2o::h2o.gbm(x = colnames(dplyr::select(eeg_train, -eyeDetection, -ds)),
y = colnames(dplyr::select(eeg_train, eyeDetection)),
training_frame = h2o::as.h2o(eeg_train),
validation_frame = eeg_validate,
distribution = "bernoulli",
ntrees = 100,
max_depth = 4,
learn_rate = 0.1)
##
|
| | 0%
|
|======================================================================| 100%
##
|
| | 0%
|
|================== | 25%
|
|=========================================== | 62%
|
|======================================================================| 100%
print(model)
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_model_R_1655040503280_500
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 100 100 24848 4
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 4 4.00000 12 16 15.17000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.1076065
## RMSE: 0.3280343
## LogLoss: 0.3600893
## Mean Per-Class Error: 0.1302043
## AUC: 0.9464602
## AUCPR: 0.9406235
## Gini: 0.8929204
## R^2: 0.5657448
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 4295 621 0.126322 =621/4916
## 1 546 3526 0.134086 =546/4072
## Totals 4841 4147 0.129840 =1167/8988
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.454543 0.858012 207
## 2 max f2 0.303734 0.899849 270
## 3 max f0point5 0.582481 0.882476 160
## 4 max accuracy 0.464421 0.870717 203
## 5 max precision 0.990029 1.000000 0
## 6 max recall 0.063184 1.000000 381
## 7 max specificity 0.990029 1.000000 0
## 8 max absolute_mcc 0.462354 0.739213 204
## 9 max min_per_class_accuracy 0.449299 0.868861 209
## 10 max mean_per_class_accuracy 0.454543 0.869796 207
## 11 max tns 0.990029 4916.000000 0
## 12 max fns 0.990029 4071.000000 0
## 13 max fps 0.014653 4916.000000 399
## 14 max tps 0.063184 4072.000000 381
## 15 max tnr 0.990029 1.000000 0
## 16 max fnr 0.990029 0.999754 0
## 17 max fpr 0.014653 1.000000 399
## 18 max tpr 0.063184 1.000000 381
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.1200593
## RMSE: 0.3464957
## LogLoss: 0.3894168
## Mean Per-Class Error: 0.1585421
## AUC: 0.9239359
## AUCPR: 0.9173075
## Gini: 0.8478718
## R^2: 0.5157124
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1328 307 0.187768 =307/1635
## 1 176 1185 0.129317 =176/1361
## Totals 1504 1492 0.161215 =483/2996
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.425963 0.830705 225
## 2 max f2 0.317548 0.887594 268
## 3 max f0point5 0.606576 0.850985 152
## 4 max accuracy 0.516521 0.846462 189
## 5 max precision 0.978424 1.000000 0
## 6 max recall 0.084742 1.000000 373
## 7 max specificity 0.978424 1.000000 0
## 8 max absolute_mcc 0.479757 0.690191 204
## 9 max min_per_class_accuracy 0.458183 0.839089 213
## 10 max mean_per_class_accuracy 0.479757 0.844921 204
## 11 max tns 0.978424 1635.000000 0
## 12 max fns 0.978424 1358.000000 0
## 13 max fps 0.012874 1635.000000 399
## 14 max tps 0.084742 1361.000000 373
## 15 max tnr 0.978424 1.000000 0
## 16 max fnr 0.978424 0.997796 0
## 17 max fpr 0.012874 1.000000 399
## 18 max tpr 0.084742 1.000000 373
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
10 What validation performance can you get with
h2o::h2o.xgboost
instead?
11 Using the best performing of the two models calculate the test performance
#perf <- h2o::h2o.performance(model = , newdata = h2o::as.h2o(eeg_test))
#print(perf)
12 Describe 2 possible alternative modelling approaches we discussed in class but haven’t explored in this notebook.